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Abstract —Rapid analysis of DNA sequences is important in 
preventing the evolution of different viruses and bacteria during 
an early phase, early diagnosis of genetic predispositions to 
certain diseases (cancer, cardiovascular diseases), and in DNA 
forensics. However, real-world DNA sequences may comprise 
several Gigabytes and the process of DNA analysis demands 
adequate computational resources to be completed within a 
reasonable time. In this paper we present a scalable approach 
for parallel DNA analysis that is based on Finite Automata, and 
which is suitable for analyzing very large DNA segments. We 
evaluate our approach for real-world DNA segments of mouse 
(2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) 
and turkey (0.2GB). Experimental results on a dual-socket 
shared-memory system with 24 physical cores show speedups 
of up to 17.6 X. Our approach is up to 3 x faster than a pattern- 
based parallel approach that uses the RE2 library. 

Index Terms —parallel DNA analysis, multi-core architectures, 
finite automata 

I. Introduction 

The need for high performance computational biology has 
emerged as a result of fast growth in biological information, 
the complexity of interactions that underlie many processes in 
biology, as well as the diversity and the interconnectedness 
of organisms at the molecular level Q. These biological 
information are accumulated via different techniques, however 
they require adequate analysis and processing to extract useful 
information that make the results evident. 

According to Benson et al. the number of Deoxyribonu¬ 
cleic Acid (DNA) sequences and nucleotide bases in these 
sequences is growing exponentially, doubling every 18 months. 
As these data are collected, motif search and DNA sequencing 
are just some examples among many for analytics of Next Gen 
Sequencing Analysis. 

A DNA sequence contains specific genetic instructions, 
which make the living organisms function properly. In a DNA 
strand there are four bases of nucleotides: A-adenine, C- 
cytosine, G-guanine and T-thymine. DNA analysis is important 
for discovery of differences and similarities of organisms and 
exploration of the evolutionary relationship between them. 
This process often requires comparisons of the corresponding 
DNA sequences, for example, checking whether one sequence 
is a subsequence of another, or comparing the occurrences 
of specific k-mcYS in the corresponding DNA sequences. In 
computational biology k-mevs refer to all the possible sub¬ 
strings (sub-sequences) of length k of a. DNA sequence. They 


have an important role during sequence assembly and can be 
used in sequence alignment as well. 

Analyzing DNA sequences within a reasonable time is 
important for domain scientists to study various phenomenons, 
such as the evolution of viruses and bacteria during an early 
phase pQ| , or diagnosis of genetic predispositions to certain 
diseases. 

Modern parallel computing systems promise to provide 
the capabilities to cope with the DNA analysis processing 
requirements. Existing approaches use both hardware and 
software to accelerate regular expression matching. The hard¬ 
ware based approaches (such as ||7|, p7| ) are faster, but 
less fiexible and more expensive, whereas software based 
acceleration techniques are fiexible in terms of updating or 
adding new patterns | [30| . Recently different software based 
DNA analysis techniques designed for multi-core systems have 
been proposed 0, (n), (E), (T6), (19), (23). 

In this paper, we will first explore and discuss the paral¬ 
lelization opportunities of DNA analysis, and thereafter we 
introduce a parallel algorithm for DNA analysis that is based 
on Finite Automata. We use a domain decomposition approach 
for parallelization; in our approach the DNA sequence is 
split into several chunks, and each chunk is assigned to a 
thread to perform pattern matching. Our algorithm is optimized 
to do efficient speculations of the possible initial states for 
each chunk. Only one regular expression matching (REM) 
for a chunk is required to be completely performed; the 
remaining REMs stop when the converging point is reached. 
A converging point is a state where two or more REM starting 
from different states meet after the same number of symbols 
is read. Furthermore, we use a memory efficient data structure 
that saves the necessary information to count and highlights 
the /c-mers. Experiments with real-world DNA segments (for 
human and various animals) on a dual socket shared-memory 
system with 48 threads show significant speedups compared 
to the sequential version (up to 17.6x). The implementation of 
our algorithm is up to 3x faster than a pattern-based algorithm 
implemented using the RE2 library |[^. Major contributions of 
this paper include: 

• a parallel algorithm for DNA analysis that is based on 
Finite Automata; 

• empirical evaluation of our algorithm with real-world 
DNA segments of mouse (2.7GB), cat (2.4GB), dog 


(2.4GB), chicken (1GB), human (3.2GB) and turkey 
(0.2GB); 

• a comparison of our algorithm with a pattern-based 
algorithm implementation that uses RE2 library. 


The rest of the paper is organized as follows. Section 
[I^ provides background information on pattern matching, 
whereas Section m presents our algorithm for counting and 
extracting /c-mers in a DNA sequence. Section W presents the 
experimental setup and discusses the experimental results. The 
work described in this paper is compared and contrasted to the 
related work in Section |V] Section [Vl| provides a summary of 
our work. 


II. Regular Expression Matching (REM) with 
Einite Automata (EA) 

Regular expression matching verifies whether a pattern is 
present in a string. REM is commonly used for determining the 
locations of a pattern within a sequence of tokens, in search 
and replace functions, or to highlight important information 
out of a huge data set. In the context of computational 
biology, pattern matching is used for analyzing and processing 
biological information in order to extract the useful parts of 
the data and make them evident. The formal definition of the 
REM is as follows: the input text is an array T[l..n] where n 
is the length of the input, and pattern P[l..m] where the length 
of the pattern m < n. The alphabet ^ defines the possible 
characters of the input string. 

A Einite Automaton (EA) is a machine for processing 
information by scanning the input text T in order to find the 
occurrences of the pattern P. A formal definition of the EA 
is as follows: EA is a quintuple of (Q, Qo^F), where Q 
is the finite set of states, ^ is the finite alphabet, S is the 
transition function Q x ^ > Q, qq is the start state and P 

is the distinguished set of final states. 

A well known algorithm for multiple pattern matching is the 
Aho-Corasick algorithm. It is able to match any occurrences 
(including the overlapped ones) of multiple patterns linearly 
to the size of the input string. It examines each character of 
the input string only once. It builds an automaton by creating 
states and transitions corresponding to these states. It adds 
failure transitions when there is no regular transition leaving 
from the current state on a particular character, which makes it 
possible to match multiple and overlapping occurrences of the 
patterns. Eurthermore, this algorithm is capable of delivering 
input-independent performance if implemented efficiently in 
parallel systems, which is a reason why we use this algorithm 
as basis of our work. 


III. Design and Implementation of a DNA Analysis 

ALGORITHM 

In this section we first provide the details about the outline 
of our algorithm. Thereafter we discuss the most important 
implementation aspects to achieve a scalable algorithm for 
counting and extracting specific k-mers from a large DNA 
sequence. 


A. Our algorithm for counting and extracting the location of 
k-mers (k-mers CoEx) 


Eigure[2 depicts two possible ways for parallel execution of 
regular expression matching for bio-computing applications: 

(a) input-based approach that splits the input string into 
smaller chunks and processes them in separate threads and 

(b) pattern-based approach that splits the patterns in sub¬ 
patterns, creating separate state machines for each of them and 
processing the same input string with each different machine 


GD- 



(a) Input-based approach 


Sub-DFAs 



Input data 

(b) Pattern-based approach 

Fig. 1. Load balancing using Input and Pattern partitioning approach. 


Our algorithm uses the input-based approach. The challenge 
of this approach is determining the initial state for each chunk. 
Einding the correct starting state for each chunk is important 
for finding the occurrences of the patterns that appear in 
the crossing border. Other researchers use different ways of 
finding the initial states, for instance Luchaup et al. |T^ 
use speculation to find the initial state based on the most 
visited states, Devi and Rajagopalan (ID use an index based 
technique, Chacon et al. GD use Suffix-Arrays, Villa et al. 
GD uses the pattern length overlapping approach. 
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Fig. 2. K-mers CoEx finite state automaton for matching the patterns: 
cta^’’ and 


Our way of determining the possible initial states is as 
follows: (1) find the set of source states (L) for the first 
element of the sub-input mapped to the running thread (T^); 
(2) find the set of destination states {S) for the last character 
of the sub-input mapped to the previous thread (T^_i); (3) 
find the intersection of S and L (S' D L), which is the set 
of possible initial states The first thread (Tq) always 
starts from the initial state go- Each thread is responsible for 
finding the set of possible initial states, and for each state 
of this set a regular expression matching is performed. When 
all threads have finished their job, the results are joined by a 
binary reduction, which connects the last visited state of 
to the first visited state of T^+i. 

This method provides very good results for sparse transition 
tables and good performance for dense matrices for DFAs 
with relatively small number of states. However, in a DFA 
with large number of states this method seems to be less 
efficient. This happens because one thread may be responsible 
to perform multiple REM for the same input, due to multiple 
possible initial states. To reduce the operations required for 
each thread to perform the REM starting from different states, 
further optimizations are needed. 

While investigating the REM using the modified Aho- 
Corasick DFA representation, we noticed that the result con¬ 
verges after several symbols are read (in our experiments, 10 is 
the max number of steps required to find the converging point). 
A converging point is a state where two or more REM starting 
from different states meet after the same number of symbols 
are examined. This insight allows us to significantly minimize 
the execution cost required to perform the REM starting from 
each possible initial state. The details about the process of the 
convergence are given in the next Section. 

B. Implementation Aspects 

In this section we will explain the implementation details 
of our algorithm, including the process of building the DFA, 


splitting the input among the available threads, finding the set 
of possible initial states for each chunk assigned to a thread, 
running the REM for the first state in this set and the process 
of finding the converging point. A reference of each process 
to the corresponding lines of codes in Algorithm will be 
provided. 

Furthermore, we define g^ as the state of the automaton 
where i is the state id, Ei as a sub-pattern of the selected 
patterns, Ri as the process of performing an REM starting 
from the item at index i of the set of possible initial states. 
The values used in the switch-case (Line - [30| ) (ex. 122, 
127, 128...) determine that a specific sub-pattern {Ei) has been 
matched. 

1) Building the Deterministic Finite Automaton: The AC 
algorithm with failure transitions has a drawback due to the 
non-deterministic transitions for a single input character. Fig¬ 
ure |2] illustrates our solution to eliminate the failure transitions 
by adding the right transition (indicated by dashed lines) for 
each state. Having a valid transition for each possible character 
to another state in the automaton, guarantees that for each 
symbol the same amount of operations will be performed. 
The example automaton shown on Figure is able to match 
the following patterns: ”ac^”, ’’cat”, ”cta” and ”tta”. For 
example, if we read string ”ac” we reach state g2, and when 
”a”, ”c” or ”t” is read we know exactly that state ge, gs or 
gio is next, respectively. 

2 ) Splitting the input and finding the possible starting states 
(PSS): The process of splitting the input among the available 
threads is depicted in Table [^a. This is a straight forward step, 
where the input length is divided by the number of available 
threads (Line [5][6| ). The chunks are assigned to the threads 
consecutively based on the thread IDs. 

The pseudo-code of the process of finding the PSS (See 
Section III-A| ) is shown on Algorithm p] LineWhen the PSS 
are determined the thread performs an REM starting from each 
item of PSS (Line [^. Table |^b,c depicts the REM process 
starting from each item of PSS; each table corresponds to a 
thread. For example, the first thread (see Table |l|b) initiates 
the REM starting from the following PSS: gs, gio, gi 3 , and 


^139- 

For every Ri (REM starting from the PSS at index i) a 
CR structure is created and stored in the results. The CR 
structure stores the initial state, last state and the total number 
of occurrences for each of the sub-patterns (Line [CT]-[65]). 

3) Determining the Converging Point: We investigated that 
while performing an REM of the same input string starting 
from different states, the REM converges after a certain num¬ 
ber of steps. An example of how the convergence happens is 
depicted in Table |T|b,c. For example in Table |^c, the matching 
of CCCTATACGA... (see Table |T|a) starting from g 3 leads to 
the following transitions: g 2 g 2 ^ g 2 ^ ^lo —> ^ 

^130 ^40 ^60 ^ ^15 qi.... Matching the same input 

starting from gio leads to the following transitions: gig ^ q 2 \ 
no further transitions are needed, because the state at position 
two for both REMs {Rq and Ri) is g 2 , which indicates the 
point of convergence. 














Algorithm 1 /c-mers CoEx 


2: 

3: 

4: 

5: 

6: 

7: 

8 : 


9: 

10: 

11 : 

12: 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 

21: 

22: 

23: 

24: 

25: 

26: 

27: 


Input: transition table dfa\ set of final states F; input string I 
Output: list of CR results 
procedure KCOEX(d/a, F, I) 

steps = 10 > The max number of steps required to converge 

result = list < list < CR >> > Stores the final states for each thread 

for To...Tn do in parallel > T - Thread, n - total number of threads 

startjposition = t_i * {I.length jn) > t_i = thread_id 

SI = substring (start jposition, 1. length/n) > SI - sub input 

if ! = 0 then 

PSS = getjpossible_starting_states(l\start_position\, 
l\start jposition — 1]) 

else 

PSS = [0] 

end if 

fr_list = list < FR > 
psi_i = 0 

for int es in PSS do 

CR er > stores the init state, last state, and total number of 

occurrences for each subexpression 
eharji = 0 
for ehar c in S'/ do 

if eharji == 0 then 
er.init_state = es 

else if eharji == SI .length — 1 then 
er.last_state = dfa[es][e] 

end if 

es = dfa[es][e] 

if es in F then 
switch es do 
case 117 

er. final_states[0] + + > agggtaaa \ tttaeeet is 

found 


28 : case 122 or 128 

29: er. final_states[l] + + > 

(e\g\t)gggtaaa | tttaeee(a\e\g) is found 


30: 

31: 

32: 

33: 

34: 

35: 

36: 

37: 

38: 

39: 


40: 


end if 

if psi_i == 0 and eharji < steps then 
FR fr 

f r.eur rent _st ate = es 
fr.final_states[0] = er. final_states[0] 

fr.final_states[8] = er.final_states[8] 
fr_list.add(fr) 

else if psi_i > 0 and fr_list[ehar_i].eurrent_state == es 
then > check for convergence 

er.final_states[0] + = results[t_i][0].final_states[0] — 
fr_list [ehar_i]. final_states [0] 


41: 

42: break 

43: end if 

44: end for 

45: results[t_i].add(er) 

46: end for 

47: end for 

48: end procedure 


Input: transition table dfa; the first character of the input mapped to (current 
thread) first_ehar\ the last character of the input mapped to T^-i (previous 
thread) lastjchar 
Output: list of states 

49: procedure GET_POSSlBLE_STARTlNG_STATES(c(/'a, first_ehar, last_ehar) 

50: S = L = list < q > > q - state of the DFA 

51: for qo...qn do 

52: if dfa[qi][first_ehar] G Q then > Q-list of states 

53: Si = qi 

54: end if 

55: if dfa[qi][last_ehar] G Q then 

56: Li = dfa[qi][last_ehar] 

57: end if 

58: end for 

59: return S n L 

60 : end procedure 

61 : struct CR{ 

62 : int init_state 

63: int last_state 

64: int final_states[9] > Stores the number of occurrences for each 

sub-expression (E1...E9) 

65: } 

66: struct FR{ 

67: int eurrent_state 

68 : int f inal_states[9\ 

69: } 


TABLE I 

The process oe splitting the input, pereorming the REM 

STARTING EROM EACH POSSIBLE INITIAL STATE, AND THE PROCESS OE 
CONVERGENCE 


a) Splitting the input string into chunks 

0123456789 

chunk 1 
chunk 2 


T2 


b) REM for chunk 1 


1 


7 

21 

31 

1 

7 

19 

2 

9 

1 

7 

6 


16 

21 









18 


32 

48 

13 

1 







43 


63 

21 










c) REM for chunk 2 


2 

2 

2 

10 

11 

130 

40 

60 

15 

1 


19 

2 









67 

90 

112 

129 

1 

8 

11 

5 

15 



84 

105 

2 









TABLE II 

A TABULAR REPRESENTATION OE THE frjist (LINe[T2]) 


Step 

CS 

El 

E 2 

E 3 

E 4 

E 5 

Eg 

Er 

^8 

Eg 

I 

67 

0 

0 

0 

0 

0 

0 

0 

0 

0 


4 

129 

0 

0 

0 

0 

0 

1 

0 

0 

0 


In order to properly count the number of occurrences of 
k-mers when a converging point is met, we need to store the 
number of occurrences of k-mers for the first n steps while 
performing Rq. For each of the first n examined characters a 
FR structure is created and stored in the frjist (Line 


32 


. The FR structure is shown on Algorithm Line ^6 


which stores the current state of the automaton, and the 
current number of occurrences for each of the sub-patterns. An 
example of this process is depicted in Table |I^ where each row 
represents a FR structure. In this example, we assume that ^44 
is the first state in the PSS, which means that the Rq starts 
from g 44 . After four characters (CCCT) are examined, a final 


TABLE III 

The tabular representation oe the results (Line[^ 


T 

Qo 

Qn 

El 

E 2 

E 3 

Ea 

Eg 

Eg 

Er 

Es 

Eg 

1 

1 

130 

0 

0 

0 

1 

0 

0 

0 

0 

0 

96 

130 

0 

0 

0 

1 

0 

0 

0 

0 

0 

9 

3 

130 

0 

0 

1 

0 

1 

0 

2 

0 

1 

Z 

44 

130 

0 

0 

1 

0 

1 

1 

2 

0 

1 


T - thread index 
Qo - start state, Qn - end state 
El — Eq - sub-expressions 
CS - current state 





























































































state (gi 29 - representing Eq, see Table |T|e row 3) is reached, 
therefore the number of current occurrences of k-mers for E6 
becomes 1 (see Table |n|. 

The REM starting from the remaining states of PSS will be 
performed until the converging point is reached. For instance 
(see Table Bb), Ri , R 2 and Rs need only 2, 4, 2 characters 
to be examined, respectively. When the converging point is 
reached, the total number of final states (CR.final_states) 
is calculated by adding the total number of states found for 
Rq (results[t_i][0].final_states) to the Ri and subtracting 
the final states (fr_list[char_i].final_states) found of Rq 
at the converging point (Line[4T]). 

IV. Experimental Evaluation 

In this section we describe the experimentation environment 
used for the evaluation of our proposed algorithm and we 
discuss the obtained performance results. 


A. Experimentation Environment 

We have performed experiments on a shared-memory sys¬ 
tem with two 12-core Intel Xeon processors of the type E5- 
2695 v2 and 16GB Memory. In total the system has 24 
physical cores and each physical core supports two threads 
(also known as logical cores). We have implemented our 
algorithm using programming language and OpenMR 

For compilation we used the Intel Compiler icc 15.0.0. In 
order to address the variability in performance measurements 
we have repeated each experiment 20 times. 

For our experimental evaluation we have selected data-sets 
of genomes from the GenBank National Center for Biotech¬ 
nology Information sequence database Q: mouse (2.7GB), 
cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) 
and turkey (0.2GB). The information about the data-sets is 


provided in Table IV 


TABLE IV 
DNA DATA-SETS 



Genome Reference 

Size (MB) 

Mouse 

GRCm38.p2 

2830 

Cat 

Felis_catus-6.2 

2490 

Dog 

CanFamS.I 

2440 

Chicken 

Gallus_gullus-4.0 

1060 

Human 

GRCh38 

3250 

Turkey 

Meleagris_gallopavo 

193 


In our experiments we used the same patterns as in the 
regex-dna benchmark, listed in Table [V| Q, which are used 
to extract and match DNA 8-mers and substitute nucleotides 
according to standards of International Union of Biochemistry 
(lUB). In Section IV-B we compare the performance of regex- 
dna benchmark with our k-mcrs CoEx algorithm. 

The DFA for the given regular expression (Table |V]) was 
generated using our PaREM tool (2D Figure depicts the 
DFA of 137 states, which is able to find the occurrences 
(including the overlapping ones) of the selected patterns. For 
simplicity the failure links are omitted from the DFA graph. 


TABLE V 

Patterns oe the regex-dna benchmark. The symbol ”|” determines 
THE ”OR” REGEX OPERATOR 


El 

agggtaaa 

tttaccct 

E2 

{c\g\t)gggtaaa 

tttaccc{a\c\g) 

E3 

a(a\c\t)ggtaaa 

tttacc(a\g\t)t 

EA 

ag(a\c\t)gtaaa 

tttac(a\g\t)ct 

ES 

agg(a\c\t)taaa 

ttta{a\g\t)cct 

E6 

aggg{a\c\g)aaa 

ttt(c\g\t)ccct 

E7 

agggt{c\g\t)aa 

tt{a\c\g)accct 

E8 

agggta{c\g\t)a 

t(a\c\g)taccct 

E9 

agggtaa(c\g\t) 

{a\c\g)ttaccct 



Eig. 5. speedup of our fc-mers CoEx algorithm implementation. 


B. Results 

We first present the performance results of our k-mcrs 
CoEx algorithm for various problem and machine sizes, and 
thereafter we compare our algorithm with a pattern-based 
algorithm implementation that uses RE2 library (known as 
regex-dna benchmark). Figure depicts the execution time 
in logarithmic scale for each of our selected data-sets and for 
various numbers of threads {1,2,6,12,24,48}. We observe a 
good scalability of our algorithm as we increase the number 
of threads or the input size. For example, the analysis of the 
human’s DNA sequence using one thread takes 27 seconds, 
and by increasing the number of threads to 2, 6, 12, 24 and 
48 the execution time reduces to 14.3s, 5.3s, 3s, 2s and 1.5s 
respectively. 

Figure [^depicts the obtained speedup of our algorithm com¬ 
pared to a sequential version of the Aho-Corasick algorithm 
for DNA analysis. We may observe that the k-mers CoEx 
algorithm scales gracefully with respect to the size of data-sets 
and the number of threads. The maximal speedup of 17.65 x 
is achieved for the largest data-set (that is the human DNA 
segment) using 48 threads. 

Figure compares the performance of our k-mers CoEx 
algorithm with the regex-dna benchmark (TJ, which is im¬ 
plemented in C-f-f using the RE2 library and OpenMR 
The RE2 implementation is based on splitting the pattern in 
smaller patterns, and matching the input string in parallel for 
each sub-pattern. Since the regex-dna benchmark does not 

















Fig. 3. The DFA automaton that matches (counts and extracts the location of fc-mers) 8-mers from a given DNA sequence. The corresponding regular 
expression is shown on Table |v] For simplicity the failure links are omitted from the figure. 
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Number of threads 


Fig. 4. Performance results of our fc-mers CoEx algorithm for various numbers of threads and data-sets. As input are used six DNA sequences of various 
lengths: mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB), and turkey (0.2GB). The experiments are performed by varying the 
number of threads (2, 6, 12, 24, and 48). The performance measurements for each experiment have been repeated 20 times. 


support larger data-sets, we have compared the two algorithms 
for the two smallest data-sets: chicken (1060MB) and turkey 
(193MB). Our k-mers CoEx algorithm outperforms the regex- 
dna benchmark for both data-sets. We may observe that the 
/c-mers CoEx algorithm running on one thread takes the same 
amount of time as the regex-dna benchmark running on the 
total amount of threads. This happens because one thread has 
to perform at least one sequential REM for a specific sub¬ 
pattern. In this class of algorithms where the execution time 
is mainly dependent on the length of the input, balancing the 
work among the available threads should be done by splitting 
the input string, instead of the pattern length. One could benefit 
from partitioning a long pattern into smaller one, in cases when 
the input string is either relatively short or can not be split 


(Real-time Network Intrusion Detection). 

V. Related Work 

In this section we discuss the state-of-the-art in pattern 
matching and DNA sequence analysis techniques for multi¬ 
core architectures. 

Existing approaches use both hardware and software to 
accelerate the process of regular expression matching. In com¬ 
parison to hardware based state machines, which are faster, 
less flexible and more expensive, software based acceleration 
techniques are flexible in terms of updating or adding new 
patterns p0| . 

Herath et al. presented in an implementation of the 
Aho-Corasick string matching algorithm using POSIX threads. 






































































Fig. 6. Performance comparison between fc-mers CoEx and the regex-dna 
benchmark (RE2) 


which is based on the pattern partitioning approach. A repli¬ 
cation of the Herath’s study with the intention to improve the 
software implementation of the Aho-Corasick algorithm was 
conducted by Arudchutha et al. 

Mar 9 ais and Kingsford present the Jellyfish tool, which 
is based on the lock-free hash table that is optimized for 
counting k-mcrs of length up to 31 bases. Rizk et al. 
present a similar approach to Jellyfish so called DSK, 
which is designed for small-memory servers. The k-mers are 
counted by traversing the hash tables. Using hash tables for the 
internal representation resulted to be memory inefficient GD 
As described by Drews et al. (T4) a sequence corresponding 
to a human chromosome with 24-230MB of input data would 
require gigabytes of memory to store the k-mcrs information. 

Drews et al. 0 achieved significant speedup by partition¬ 
ing the input string among the threads in such a way that 
each thread processes only sequences starting with a specified 
prefix used to divide the radix tree among the threads. They 
achieved up to 6.9 x speedup on a shared memory system with 
8 cores. 

The n-step FM-index approach presented by Chacon et al. 
0 achieved speedups from 1.4 x to 2.4 x with respect to 
their original FM-index search algorithm. 

An approach based on the Aho-Corasick string matching 
algorithm designed for the Cray XMT architecture is proposed 
by Villa et al. 0- They split the input among the available 
threads, and overlap the input by the pattern length. Their 
approach is applicable for multiple patterns as long as they are 
of the same length, otherwise, the occurrences of the shortest 
patterns occurring on the crossing border may be counted by 
both threads. 

A method for searching arbitrary regular expressions using 
speculation is proposed by Luchaup et al. |T^ . The drawback 
is that if an REM performed by a thread does not converge 
on its sub-input, then the next thread has to start from a new 
state that breaks the serialization and limits the scalability. 

Our k-mers CoEx algorithm is tailored for large-scale DNA 


analysis. In our approach, the DNA segment is split into 
several chunks, and efficient speculations of the possible 
initial states for each chunk are performed. Furthermore, our 
algorithm optimizes the REM using a converging point. 

VI. Summary 

We have described a parallel algorithm based on Einite 
Automata for counting and extracting /c-mers in a DNA 
segment. In a series of experiments with real world data¬ 
sets we have observed that the algorithm scales well with 
respect to various problem and machine sizes. We achieved 
the maximal speedup of 17.65 x for the largest data-set (that 
is the human DNA segment) using 48 threads on a dual-socket 
shared-memory system with 24 physical cores. In comparison 
to the regex-dna benchmark our algorithm was up to three 
times faster. 

In this paper we have studied the performance of our 
approach for DNA sequence analysis on a shared-memory 
system with two 12-core Intel Xeon processors. It may be 
useful to compare the performance that we achieved using 
all available cores of the host Intel Xeon processors with the 
performance achieved when all available cores of the Intel 
Xeon Phi coprocessor are used p^ . Eurthermore, software 
technologies, such as 0> enable the use of all cores of 
homogeneous processors of the host and all available cores 
of the coprocessor. 

Euture work could address generalization of our approach 
for DNA sequence analysis for various types of accelerated 
systems using techniques that ensure performance portability 
0 0, 124} , |[29| . The use of modeling and simulation 
techniques |1Q| |, |15| , 1^ , pb] could help to reason about the 
performance on extreme-scale computing architectures 0. 
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